This dataset deals with cardiovascular diseases in patients with different aspects. The dataset consists of 70 000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class "cardio" equals to 1, when patient has cardiovascular disease, and 0, if patient is not suffering from heart disease. All of the dataset values were collected at the moment of medical examination.
There are a total of 12 fields related with the cardiovascular disease data. So we try to find out the related aspects for patient having cardiovascular diseases among the total sample. The stakeholders for this study would be patients - for the well being and healthy life and also pharamaceutical companies to help them in the drugs manufacturing.
We are the founder and CEO of US cardiovascular disease center organization and we want to dedicate our research by giving back to the community and for the well being of the patients. Cardiovascular diseases are very dangerous if not diagnosed on time. If not taken proper care after being diagnosed it could lead to life-threatening factors. So our organization will help the public to find the probability of having a heart disease as per the patients features. Our report could incorporate information about the number of each type of cardiovascular disease related features. We are planning to study this by grouping the patients into "With" and "Without" heart disesase and analysing different factors/features that are highly correlated to cause a heart disesase from the given data.
To understand and explore dataset, read the available cardio_train.csv file using pandas and display top 5 rows of the data
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
cardio_df = pd.read_csv("C:/Users/kanum/Desktop/Usha/UOP/ANLT/Python Project/cardiovascular-disease-dataset/cardio_train.csv", sep=";")
cardio_df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
Observations from output: The dataset has details about the patients like ID, age(in days), gender(1-Women and 2-Men), height(in cm), weight(in kg), systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo), cholesterol group(1: normal, 2: above normal, 3: well above normal), glucose level(: normal, 2: above normal, 3: well above normal), smoke, alcohol and physical activity by patient(0-No, 1-Yes) and the target variable, cardio(0-No, 1-Yes)
print("Number of rows in dataset {}".format(cardio_df.shape[0]))
print("Number of columns in dataset {}".format(cardio_df.shape[1]))
Number of rows in dataset 70000 Number of columns in dataset 13
Observations from output: The dataset has a total of 70000 rows and 13 columns.
cardio_df.describe()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 |
| mean | 49972.419900 | 19468.865814 | 1.349571 | 164.359229 | 74.205690 | 128.817286 | 96.630414 | 1.366871 | 1.226457 | 0.088129 | 0.053771 | 0.803729 | 0.499700 |
| std | 28851.302323 | 2467.251667 | 0.476838 | 8.210126 | 14.395757 | 154.011419 | 188.472530 | 0.680250 | 0.572270 | 0.283484 | 0.225568 | 0.397179 | 0.500003 |
| min | 0.000000 | 10798.000000 | 1.000000 | 55.000000 | 10.000000 | -150.000000 | -70.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 25006.750000 | 17664.000000 | 1.000000 | 159.000000 | 65.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 50% | 50001.500000 | 19703.000000 | 1.000000 | 165.000000 | 72.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 75% | 74889.250000 | 21327.000000 | 2.000000 | 170.000000 | 82.000000 | 140.000000 | 90.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| max | 99999.000000 | 23713.000000 | 2.000000 | 250.000000 | 200.000000 | 16020.000000 | 11000.000000 | 3.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Observations from output: The average age of the patients used in the sample is 19468 days, height is 164.35cm and weight is 74Kg. Mean of Chloesterol and Glucose is 1.3 and 1.22 respectively which are close to "Normal" indicating most of the patients in the sample are normal on cholesterol and glucose levels. Mean of smoke and alcohol of patients in the dataset are less than 0.5 revealing mosting of them are non-smokers and non-alcholic. Mean of actie is greater than 0.5 which means most of the patients work-out for theit health. Mean of cardio is 0.49 almost close to 0.5 showing 50% of patients in the sample are affected by heart disease. There seem to be abnormally high and low values on ap_hi, ap_lo which should be taken care while cleaning data.
cardio_df.dtypes
id int64 age int64 gender int64 height int64 weight float64 ap_hi int64 ap_lo int64 cholesterol int64 gluc int64 smoke int64 alco int64 active int64 cardio int64 dtype: object
Observations from output: All the column in the dataset are numerical and most of them are int64 type except one, which is of float type.
As part of data cleaning, check for rows with NA values.
print("Number of NA's in dataset {}".format(len(cardio_df[(cardio_df.isna().sum(axis=1))>0])))
Number of NA's in dataset 0
Observations from output: There are no rows with NA values in the dataset.
# removing rows with highly extreme and low values of systolic bp
cardio_df = cardio_df[(cardio_df["ap_hi"]>30) & (cardio_df["ap_hi"]<250)]
print("Number of rows in dataset {}".format(cardio_df.shape[0]))
Number of rows in dataset 69772
Observations from output: After removing rows with abnormally high and low(even negative) values of systolic blood pressure 69772 rows remained in the dataset.
# removing rows with highly extreme and low values of diastolic bp
cardio_df = cardio_df[(cardio_df["ap_lo"]>30) & (cardio_df["ap_lo"]<150)]
print("Number of rows in dataset {}".format(cardio_df.shape[0]))
Number of rows in dataset 68747
Observations from output: After removing rows with abnormally high and low(even negative) values of diastolic blood pressure 68747 rows are remaining for the data analysis.
cardio_df["age_years"] = round((cardio_df["age"]/365.25),0)
cardio_df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | age_years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 | 50.0 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 | 55.0 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 | 52.0 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 | 48.0 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 | 48.0 |
Observations from output: A new column age_years is created holding values of age in years. This column can be used to find if people tend to have higher risk of heart disease with age.
cardio_df["bmi"] = cardio_df["weight"]/((cardio_df['height']/100)**2)
cardio_df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | age_years | bmi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 | 50.0 | 21.967120 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 | 55.0 | 34.927679 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 | 52.0 | 23.507805 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 | 48.0 | 28.710479 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 | 48.0 | 23.011177 |
Observations from output: A new column bmi is created by using weight(kg)/(height(m)* height(m)) of each patient. This column can be used to find if people with higher BMI are at higher risk of heart disease.
#bar: Gender vs count of Cardio
ax = cardio_df.groupby(['gender','cardio'])['gender'].count().unstack(0).plot.bar(title="Number of heart diseases by Gender", figsize=(12,4))
_ = ax.set_xlabel('cardio')
_ = ax.set_ylabel('Frequency')
ax.set_xticklabels(["No","Yes"])
ax.legend(["Female","Male"])
<matplotlib.legend.Legend at 0x2839ae18160>
Observation from bar chart: This graph shows that gender plays an important role in not having heart disease and females are more healthier than the male participants. Contradicting to this, in case of people having heart diseases, females count is higher compared to males. The reason for contradiction could be that females are more in the dataset used for anlaysis than males.
#hist: age_years
plt.figure(figsize=(10,6))
sns.distplot(cardio_df['age_years'],color='red', kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x2839b13f668>
Observation from histogram: We can see that patients with age group of 50-60 years are more in the dataset used for analysis. The dataset is collected in this fashion probably because to check if people in this age grop tend to have higher chances of having cardiovascular disease.
#pie chart for cholesterol and cardio
dict = {1:"Normal", 2:"Above Normal",3:"Well Above Normal"}
cardio_df["cholesterol"].value_counts(sort=False).plot.pie(figsize=(12,8), title="Number of people in cholesterol groups",labels=dict.values(),autopct='%1.1f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x2839b24b7b8>
Observations from Pie-Chart: This pie chart shows the cholesterol level is normal for most of the participants. 75% of the total participants has normal cholesterol levels while 13.5% and 11.5% have abmormal and well above normal levels.
#cardio_df['bmi'] = cardio_df['weight']/((cardio_df['height']/100)**2)
ax = sns.catplot(x="gender", y="bmi", hue="alco", col="cardio", data=cardio_df,color = "yellow",kind="box", height=10, aspect=.7);
ax.set_xticklabels(["Women","Men"])
new_title = 'Alcoholic'
ax._legend.set_title(new_title)
# replace labels
new_labels = ['No', 'Yes']
for t, l in zip(ax._legend.texts, new_labels): t.set_text(l)
Observations from Box-plot: This box plot depicts that in the group of people with cardiovascular disease(right), women who consume alcohol have higher risks than acoholic men based on thier BMI. Also, alcoholic women are more prone to heart diseases than non-alcoholic men. Also, in the group of people without cardiovascular disease(left), women are more healthier than men.
corr = cardio_df.iloc[:,2:].corr()
fig, ax = plt.subplots(figsize=(12,10))
ax = sns.heatmap(corr, linewidths=.8, annot=True)
plt.show()
Observations from Heatmap: The heatmap shows the target variable(cardio) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.
Create a dataframe using patient id, ap_hi, ap_lo, cholesterol and target variable(cardio) that can be used for hierarchical clustering. Since the dataset is huge with 68K rows, group patients on ap_hi and ap_lo to create clusters.
dend_df = cardio_df[["id","ap_hi","ap_lo","cholesterol","cardio"]]
dend_df = dend_df.set_index(dend_df["id"])
dend_df = dend_df[["ap_hi","ap_lo","cardio"]]
dend_df.groupby(["ap_hi","ap_lo"]).mean().shape
(981, 1)
Observation from output: There are 981 groups of patients as per systolic and diastolic blood pressure measures.
from scipy.spatial import distance
from scipy.cluster.hierarchy import linkage, dendrogram
cardio_distance_df = pd.DataFrame(distance.cdist(dend_df.groupby(["ap_hi","ap_lo"]).mean(), dend_df.groupby(["ap_hi","ap_lo"]).mean(), 'euclidean'))
cardio_distance_df.astype(int).head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 981 columns
Observations from output: The eucledian distance for 981 groups of patients is calculated correctly as above. Since cardio is either 0 or 1 most of the values are either 0 or 1.
import plotly.figure_factory as ff
import chart_studio.plotly
from plotly.figure_factory import create_dendrogram
fig = ff.create_dendrogram(dend_df.groupby(["ap_hi","ap_lo"]).mean(), color_threshold=1.5)
#fig = ff.create_dendrogram(Z, color_threshold=1.5)
#fig = ff.create_dendrogram(dend_df[["ap_hi","ap_lo","cardio"]], color_threshold=1.5)
fig.update_layout(width=800, height=500)
fig.show()
Observation: The plot shows the possibility of having 2 clusters.
Z = linkage(dend_df.groupby(["ap_hi","ap_lo"]).mean(), 'average')
plt.figure(figsize=(25, 10))
D = dendrogram(Z=Z, leaf_font_size=9, labels = dend_df.index)
Observations from dendrogram: The dendrogram shows that there are 2 groups of clusters after grouping by blood pressure measures.
Creating a function, kmeans_fun, which takes the dataframe(df), number of clusters(k), number of dimensions(num_dim) and number of iterations(num_iter) to be performed as parameters/arguments. The function creates number of clusters(k) defined by using data from dataframe(df) with number of columns to be used for clustering(num_dim). The flow of function:
The function returns the dataframe by assigning each row to one of the clusters(Assoc) and the latest centroid co-ordinates.
def kmeans_fun(df,k,num_dim,num_iter):
#Create k number of random centroids
import random
#random.seed(180)
cent_lis = []
for c in range(0,k):
#cent_lis.append([round(random.randint(0,2),2), round(random.randint(0,3),2), round(random.randint(0,1),2)])
cent_lis.append([round(random.randint(30,250),2), round(random.randint(30,150),2), round(random.randint(0,1),2)])
df["Dist_C" + str(c+1)] = 0
cent = np.array(cent_lis).astype('float64')
print("Random centroids created: ", cent)
print(" ")
#Number of iterations to be performed to re-center centroids
for i in range(0,num_iter):
for c in range(0,k):
#Calculate the distance of points to each of the centroids using eucledian formula
for n in range(0,num_dim):
df["Dist_C" + str(c+1)] = df["Dist_C" + str(c+1)] + (df.iloc[:,n] - cent[c][n])**2
df["Dist_C" + str(c+1)] = np.sqrt(df["Dist_C" + str(c+1)])
#Assign the point to cluster based on distances
df["Assoc"] = df[df.columns[num_dim:num_dim+k]].idxmin(axis=1).str.strip().str[-1]
df.Assoc = df.Assoc.astype('int64')
prev_cent = np.array(cent)
g = df.groupby("Assoc")[df.columns[0:num_dim]].mean()
#Recentering centroids
for c in range(0,k):
if c+1 in g.index:
cent[c] = np.array([round(g.loc[c+1,:],2)]).astype('float64')
#Exit the loop if position of the centroids do not change
if (np.array_equal(prev_cent,cent)):
print("Number of iterations: ", i)
print(" ")
break
return df,np.array(cent)
Call the function created to cluster data on new dataframe with 3 dimensions(num_dim), 2 clusters(k), 10 iterations(num_iter):
df_kmean = pd.DataFrame(cardio_df[["ap_hi","ap_lo","cardio"]])
df_kmean, cent_fin =kmeans_fun(df_kmean,2,3,10)
print("Optimal centroids: ", cent_fin)
print(df_kmean.groupby("Assoc").mean())
Random centroids created: [[ 33. 141. 0.]
[ 77. 137. 0.]]
Number of iterations: 3
Optimal centroids: [[ 82.38 122.56 0.56]
[126.67 81.3 0.49]]
ap_hi ap_lo cardio Dist_C1 Dist_C2
Assoc
1 82.378049 122.560976 0.560976 13.03798 62.200935
2 126.666715 81.298886 0.494677 63.22374 15.809170
Observations from output: The intial random centroid created, number of iterations to arrive at optimal centroids, the dataframe with each patient group associated to a cluster and final centroid are as shown above. Even though we passed 10 for the loop in the function to arrive at optimal centroids, the function could find the optimal centroids well before 10 iterations.
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
colormap = np.array(['red', 'blue'])
model = KMeans(n_clusters = 2)
model.fit(df_kmean.iloc[:,0:3])
model.labels_
df_kmean.groupby(colormap[model.labels_]).mean()
#df_fun.groupby(colormap[model.labels_]).count()
| ap_hi | ap_lo | cardio | Dist_C1 | Dist_C2 | Assoc | |
|---|---|---|---|---|---|---|
| blue | 145.781692 | 90.465465 | 0.803352 | 72.279176 | 23.156116 | 2.000000 |
| red | 117.439777 | 76.984342 | 0.347056 | 58.801108 | 12.374588 | 1.998236 |
Observations from output: Total 2 clusters are created by KMeans algorithm. The patient groups in one cluster have the mean value of ap_hi and ap_lo comparatively higher than the mean values of ap_hi and ap_lo in another cluster. The cluster with higher blood pressure values are at a higher risk of heart diseases than the patient groups in other cluster.
# View the results
fig = plt.figure()
ax = Axes3D(fig)
# Create a colormap
colormap = np.array(['red', 'blue'])
ax.scatter(df_kmean.ap_hi, df_kmean.ap_lo, df_kmean.cardio, c=colormap[model.labels_], s = 10)
plt.title('K Means Classification')
ax.set_xlabel('ap_hi')
ax.set_ylabel('ap_lo')
ax.set_zlabel('cardio')
plt.show()
Observations from plot: KMeans function from sklearn has clustered the patients into 2 as shown in the plot. Red, blue colours are used to show the patient clusters. From the graph it can be inferred that as the blood pressure values raise the risk for cardiac diseases increases.
The function takes number of nearest neighbours, data point for prediction and the dataframe on which knn algorithm should be fitted.
from sklearn.neighbors import KNeighborsClassifier
def knn_func(k,data_point,df):
knn = KNeighborsClassifier(n_neighbors = k, p = 2)# p=2 for euclidean distance
knn.fit(df[df.columns[0:3]],df[df.columns[3]])
#knn.fit(df[df.columns[0:2]],df[df.columns[2]])
class_name = ["Red", "Blue"]
data_class = knn.predict(data_point.reshape(1, -1))[0]
print('Prediction: cluster ', data_class, class_name[data_class])
return data_class, class_name[data_class]
The function is successfully created which can then be used for fitting and predicting cluster. To call the function, 3 dimensions, ap_hi, ap_lo and cholesterol are used to predict cardiac risks.
df_knn = cardio_df[["ap_hi","ap_lo","cholesterol","cardio"]]
#df_knn = cardio_df[["ap_hi","ap_lo","cardio"]]
for i in range(0,5):
ap_hi = int(input('Systolic blood pressure ap_hi: '))
ap_lo = int(input('Diastolic blood pressure ap_lo: '))
cholesterol = int(input('cholesterol from 1 to 3: '))
knn_func(3,np.array([ap_hi, ap_lo, cholesterol]),df_knn)
#knn_func(3,np.array([ap_hi, ap_lo]),df_knn)
#data_class.groupby(class_name[data_class]).mean()
Systolic blood pressure ap_hi: 110 Diastolic blood pressure ap_lo: 70 cholesterol from 1 to 3: 3 Prediction: cluster 0 Red Systolic blood pressure ap_hi: 110 Diastolic blood pressure ap_lo: 70 cholesterol from 1 to 3: 1 Prediction: cluster 1 Blue Systolic blood pressure ap_hi: 180 Diastolic blood pressure ap_lo: 100 cholesterol from 1 to 3: 3 Prediction: cluster 0 Red Systolic blood pressure ap_hi: 180 Diastolic blood pressure ap_lo: 100 cholesterol from 1 to 3: 1 Prediction: cluster 1 Blue Systolic blood pressure ap_hi: 130 Diastolic blood pressure ap_lo: 80 cholesterol from 1 to 3: 2 Prediction: cluster 1 Blue
Observations from output: KNN predictions show that the cholesterol also plays an imporatant role for the increasing the probability of cardiac diseases along with the systolic and diastolic blood pressure measures. As the blood pressure measures or cholesterol levels increases a patient has a higher risk of cardiac disease.
## View the results
fig = plt.figure()
ax = Axes3D(fig)
# Create a colormap
colormap = np.array(['red', 'blue'])
ax.scatter(df_knn.ap_hi, df_knn.ap_lo, c = colormap[df_knn.cardio], s = 10)
#plt.title('KNN Classification')
plt.title('Cardio:\nNo - Red\Yes - Blue')
ax.set_xlabel('ap_hi')
ax.set_ylabel('ap_lo')
ax.set_zlabel('cholesterol')
plt.show()
Observations from the plot: Patient groups with lower values on ap_hi and ap_lo along with cholesterol levels have a healthy heart as compared to the patients with higher values on all 3 variables.
We are the founder and CEO of US cardiovascular disease center organization and we want to dedicate our research by giving back to the community and for the well being of the patients. Cardiovascular diseases are very dangerous if not diagnosed on time. If not taken proper care after being diagnosed it could lead to life-threatening factors. So our organization will help the public to find the probability of having a heart disease as per the patients features. Our report could incorporate information about the number of each type of cardiovascular disease related features. We are planning to study this by grouping the patients into "With" and "Without" heart disesase and analysing different factors/features that are highly correlated to cause a heart disesase from the given data.
For this study we are using the dataset deals with cardiovascular diseases in patients with different aspects. The dataset consists of 70 000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure, and etc. The target class "cardio" equals to 1, when patient has cardiovascular disease, and 0, if patient is not suffering from heart disease. All of the dataset values were collected at the moment of medical examination.
The average age of the patients used in the sample is 19468 days, height is 164.35cm and weight is 74Kg. Mean of Chloesterol and Glucose is 1.3 and 1.22 respectively which are close to "Normal" indicating most of the patients in the sample are normal on cholesterol and glucose levels. Mean of smoke and alcohol of patients in the dataset are less than 0.5 revealing mosting of them are non-smokers and non-alcholic. Mean of active is greater than 0.5 which means most of the patients work-out for theit health. Mean of cardio is 0.49 almost close to 0.5 showing 50% of patients in the sample are affected by heart disease. There seem to be abnormally high and low values on ap_hi, ap_lo which should be taken care while cleaning data.
There are no rows with NA values in the dataset. However since ap_hi and ap_lo columns had abnormal values we take the rows in the range of 30 to 250 for ap_hi and 30-150 for ap_lo. After cleaning for ap_hi and ap_lo 68747 rows of data are left which an be used for data analysis and clustering.
This graph shows that gender plays an important role in not having heart disease and females are more healthier than the male participants. On similar lines, females count is higher compared to males in case of people having heart diseases. Hence the inference of predicting cardio based on gender might not be accurate.
The pie chart shows the cholesterol level is normal for most of the participants. 75% of the total participants have normal cholesterol levels while 13.5% and 11.5% have abmormal and well above normal levels.
This box plot depicts that in the group of people with cardiovascular disease(right), women who consume alcohol have higher risks than acoholic men based on thier BMI. Also, alcoholic women are more prone to heart diseases than non-alcoholic men. In the group of people without cardiovascular disease(left), women are more healthier than men.
The heatmap shows the target variable(cardio) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.
Plotting a dendrogram for the grouped data(on ap_hi and ap_lo):
The dendrogram shows that there can be 2 clusters when patients are grouped by ap_hi and ap_lo.
Creating clusters using KMeans algorithm:
Total 2 clusters are created by KMeans algorithm. The patient groups in one cluster have the mean value of ap_hi and ap_lo comparatively higher than the mean values of ap_hi and ap_lo in another cluster. The cluster with higher blood pressure values are at a higher risk of heart diseases than the patient groups in other cluster.
KMeans function from sklearn has clustered the patients into 2 as shown in the plot. Red, blue colours are used to show the patient clusters. From the graph it can be inferred that as the blood pressure values raise the risk for cardiac diseases increases.
The 3 dimensions(ap_hi, ap_lo and cholesterol) are used to fit KNN to see if the patient has heart disease. Results: Patient groups with lower values on ap_hi and ap_lo along with cholesterol levels have a healthy heart as compared to the patients with higher values on all 3 variables.
1) Cardio(a person having heart disease or not) has strong correlation with systolic blood pressure(ap_hi), diastolic blood pressure(ap_lo),age and cholesterol when compared to other features.
2) People with high blood pressure are at very high risk of cardiac disease.
3) People with hgh cholesterol have higher probability of getting affected by cardiovascualr disease.
4) Women who consume alcohol have higher risks of heart disease than acoholic men based on thier BMI.